Goto

Collaborating Authors

 parameter lambda



Review for NeurIPS paper: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

Neural Information Processing Systems

Weaknesses: --- Missing details about lambda While mentioned line 138, the dampening parameter lambda does not appear in the experimental section of the main body, and I only found a value 1e-5 in the appendix (l799). How do you select its value? I expect your final algorithm be very sensitive to lambda, since \delta_L as defined in eq.4 will select directions with smallest curvature. Another comment about lambda is that if you set it to a very large value k, then its becomes dominant compared to the eigenvalues of F, then your technique basically amounts to magnitude pruning. In that regards, it means that MP is just a special case of your technique, when using a large dampening value.


Review for NeurIPS paper: Deep reconstruction of strange attractors from time series

Neural Information Processing Systems

The paper considers the setting in which the observed time series is governed by a dynamical system. However, when the problem is cast into a machine learning setup for general time series analysis, this distinction is sometimes lost. This may be a point to mention in the broader impacts section: in many applications it is not known if the time series data of interest is governed by a dynamical system. I have some concerns about this claim: a. Could the authors provide more evidence that the learning rate should not be considered a hyperparameter ("essentially one governing hyperparameter" in line 316)? After all, in lines 189-190 the learning rate is listed as a parameter that is tuned.


Conditional Matrix Flows for Gaussian Graphical Models

Neural Information Processing Systems

Studying conditional independence among many variables with few observations is a challenging task.Gaussian Graphical Models (GGMs) tackle this problem by encouraging sparsity in the precision matrix through l_q regularization with q\leq1 .However, most GMMs rely on the l_1 norm because the objective is highly non-convex for sub- l_1 pseudo-norms.In the frequentist formulation, the l_1 norm relaxation provides the solution path as a function of the shrinkage parameter \lambda .In the Bayesian formulation, sparsity is instead encouraged through a Laplace prior, but posterior inference for different \lambda requires repeated runs of expensive Gibbs samplers.Here we propose a general framework for variational inference with matrix-variate Normalizing Flow in GGMs, which unifies the benefits of frequentist and Bayesian frameworks.As a key improvement on previous work, we train with one flow a continuum of sparse regression models jointly for all regularization parameters \lambda and all l_q norms, including non-convex sub- l_1 pseudo-norms.Within one model we thus have access to (i) the evolution of the posterior for any \lambda and any l_q (pseudo-) norm, (ii) the marginal log-likelihood for model selection, and (iii) the frequentist solution paths through simulated annealing in the MAP limit.


Reviews: Learning A Structured Optimal Bipartite Graph for Co-Clustering

Neural Information Processing Systems

The authors propose a new method for co-clustering. The idea is to learn a bipartite graph with exactly k connected components. This way, the clusters can be directly inferred and no further preprocessing step (like executing k-means) is necessary. After introducing their approach the authors conduct experiments on a synthetic data set as well as on four benchmark data sets. I think that the proposed approach is interesting. However, there are some issues.


Reviews: Synaptic Strength For Convolutional Neural Network

Neural Information Processing Systems

Content: This submission introduces a new framework for compressing neural networks. The main concept is to define the "synaptic strength" of the connection between an input layer and an output feature to the the product of the norms of the kernel of the input layer and the norm of the input layer. A large synaptic strength indicates that a certain input feature plays a substantial role in computing the output feature. The synaptic strength is incorporated into the training procedure by the means of an additional penalty term that encourages a sparse distribution of synaptic strengths. After training all connections with synaptic strength smaller than some threshold are fixed to zero and the network is finetuned.